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Abstract. Given a random binary sequence X^"^ of random variables, Xt, 
t = 1, 2, . . . , n, for instance, one that is generated by a Markov source (teacher) 
of order k* (each state represented by k* bits). Assume that the probability 
- - - of the event Xt = 1 is constant and denote it by /3. Consider a learner which 

is based on a parametric model, for instance a Markov model of order fc, who 
^ , trains on a sequence a;''"' which is randomly drawn by the teacher. Test the 

' learner's performance by giving it a sequence a;'") (generated by the teacher) 

(N ■ and check its predictions on every bit of x'"). An error occurs at time t if 

^ ^, the learner's prediction Yt differs from the true bit value Xt- Denote by g'"^ 

^ ■ the sequence of errors where the error bit at time t equals 1 or according 

' to whether the event of an error occurs or not, respectively. Consider the 

subsequence ^'"^ of 5'"' which corresponds to the errors of predicting a 0, i.e., 
^f") consists of the bits of only at times t such that Yt = 0. In this paper 
we compute an estimate on the deviation of the frequency of Is of g'"^' from 
(3. The result shows that the level of randomness of g'"' decreases relative to 
an increase in the complexity of the learner. 

1. Introduction 

a sequence of binary random variables drawn accord- 
ing to some unknown joint probability distribution P (X^")) . Consider the problem 
of learning to predict the next bit in a binary sequence drawn according to P. For 
training, the learner is given a finite sequence a;^™^ of bits G {0, 1} , 1 < i < m, 
drawn according to P and estimates a model A4 that can be used to predict the 
next bit of a partially observed sequence. After training, the learner is tested on 



m 

' another sequence x*^"^ drawn according to the same unknown distribution P. Using 

, Ai he produces the bit yt as a prediction for xt , 1 < t < n. Denote by the cor- 

responding binary sequence of mistakes where = 1 if j/f 7^ Xt and is otherwise. 
We pose the following main question: how random is ? 
' It is clear that the sequence of mistakes should be random since the test se- 

^ , quence a;'"-' is random. It may also be that because the learner is using a model 

of a finite structure (or a finite description-length) that it may somehow introduce 
dependencies and cause ^'^"^ to be less random than a;'"^. And yet by another in- 
tuition, perhaps the fact that the learner is of a finite complexity limits its ability 
to 'deform' (or distort) randomness of a;'"^ ? These are all vaHd initial guesses 
that relate to this main question. We note that our basis for saying that M has 
a finite structure stems from it being an element of some regular hypothesis class, 
for instance, having a finite VC-dimension as is often the case in a learning setting 
(see for instance structural risk minimization of [2^). In the current paper, we are 
not interested in the learner's performance (as modeled for instance by Valiant's 
PAC framework but instead we take a data-centric view and ask how much 



infiuence the learner has on the stochastic properties of the errors. We view the 
learner as an entity that 'interferes' with the randomness that is inherent in the 
sequence to be predicted and through his predictions creates a sequence of mistakes 
that has a different stochastic character. This view in a broader sense is taken in 
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[l3| and is shown (empirically) in to explain how static structures may 'deform' 
random external forces. 

To answer the main question raised above we build on a specific learning set- 
ting and use it for our analysis. In this setting we consider a teacher that uses a 
probability distribution P based on a Markov model with a certain complexity. The 
learner has access to a hypothesis class of Boolean decision rules that are also based 
on Markov models. Hence, learning amounts to the estimation of parameters of a 
finite-order Markov model (see for instance [7, 14]). It is obvious that many different 
settings can be analyzed, in particular, more general ones. For instance, considering 
a learner that in addition to parametric estimation, does statistical-model-selection 

a. 

The remainder of the paper is organized as follows: in section [2] we give a brief 
introduction to the notion of randomness, in section [3] we define the problem and 
state our result, and in section [H we prove the theorem. 

2. Randomness of a finite sequence 

The notion of randomness of finite objects (binary sequences) aims to explain 
the intuitive idea that a sequence, whether finite or infinite, should be measured as 
being more unpredictable if it possesses fewer regularities (patterns). There is no 
formal definition of randomness but there are three main properties that a random 
binary string of length n must intuitively satisfy The first property is the 

so-called stochasticity or frequency stability of the sequence which means that any 
binary word of length k < n must have the same frequency limit (equal to 2"*^). 
This is basically the notion of normality that Borel introduced and is related to 
the degree of unpredictability of the sequence. The second property is chaoticity 
or disorderliness of the sequence. A sequence is less chaotic (less complex) if it 
has a short description, i.e., if the minimal length of a program that generates the 
sequence is short. The third property is typicalness. A random sequence is a typical 
representative of the class of all binary sequences. It has no specific features 
distinguishing it from the rest of the population. An infinite binary sequence is 
typical if each small subset E of fl does not contain it (the correct definition of a 
'small' set was given by Martin Lof 

As mentioned in section [H our interest in this paper is essentially to ask what 
'interference' does a learner have on the randomness of a test sequence. It appears 
essential that we look not only on the randomness of the object itself (the test 
sequence a;*^"') but also at the interfering entity — the learner, specifically, its algo- 
rithmic component that is used for prediction. Related to this, there is an area 
of research that studies algorithmic randomness which is the relationship between 
complexity and stochasticity of finite and infinite binary sequences |3!]. Algorithmic 
randomness was first considered by von Mises in 1919 who defined an infinite bi- 
nary sequence a of zeros and ones as random if it is unbiased, i.e. if the frequency 
of zeros goes to 1/2, and every subsequence of a that we can extract using an ad- 
missible selection rule (see definition below) is also not biased. Kolmogorov and 
Loveland proposed a more permissive definition of an admissible selection 

rule as any (partial) computable process which, having read any n bits of an infinite 
binary sequence a, picks a bit that has not been read yet, decides whether it should 
be selected or not, and then reads its value. When subsequences selected by such 
a selection rule pass the unbiasedness test they are called Kolmogorov-Loveland 
stochastic (KL-stochastic for short). Martin Lof introduced a notion of ran- 
domness which is now considered by many as the most satisfactory notion of algo- 
rithmic randomness. His definition says precisely which infinite binary sequences 
are random and which are not. The definition is probabilistically convincing in 
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that it requires each random sequence to pass every algorithmically implementable 
statistical test of randomness. 

Let us briefly deflne what is meant by a selection rule. As mentioned above, 
this is a principal concept used as part of tests of randomness of sequences. Let 
{0, 1}* be the space of all finite binary sequences and denote by {0, 1}" the set of 
all finite binary sequences of length n. An admissible selection rule R is defined 
(0,[23|) based on three partial recursive functions /, g and h on {0, 1}*. Let a;^") = 
xi, . . . ,Xn- The process of selection is recursive. It begins with an empty sequence 

0. The function / is responsible for selecting possible candidate bits of a;*^"' as 
elements of the subsequence to be formed. The function g examines the value of 
these bits and decides whether to include them in the subsequence. Thus / does 
so according to the following definition: /(0) — ii, and if at the current time 
k a subsequence has already been selected which consists of elements Xi^, . . . , Xi^ 
then / computes the index of the next element to be examined according to element 
f{xi-^ , . ■ • , Xi^) = i where i ^ {ii, . . . ,ik}, i.e., the next element to be examined must 
not be one which has already been selected (notice that maybe i < ij, I < j < k, 

1. e., the selection rule can go backwards on x). Next, the two-valued function g 
selects this element Xi to be the next element of the constructed subsequence of 
X if and only if g{xi^, . . . ,Xi^) = 1. The role of the two- valued function h is to 
decide when this process must be terminated. This subsequence selection process 
terminates if h{xi-^ , . . . , Xi^) = 1 or f{xi-^, . . . ,Xi^,) > n. Let denote the 
selected subsequence. By K{R\n) we mean the length of the shortest program 
computing the values of /, g and h given n. 

From the previous discussion, we know that there are two principal measures 
related to the information content in a finite sequence x^^\ stochasticity (unpre- 
dictability) and chaoticity (complexity). An infinitely long binary sequence is re- 
garded random if it satisfies the principle of stability of the frequencyof Is for any of 
its subsequences that are obtained by an admissible selection rule j9||. Kolmogorov 
showed that the stochasticity of a finite binary sequence x may be precisely ex- 
pressed by the deviation of the frequency of ones from some < p < 1, for any 
subsequence of x^"^ selected by an admissible selection rule R of finite complexity 
K{R\n). For an object x given another object y he defined in Q the complexity of 
X as 

K{x\y) = min{Z(7r) : (j){'K^y) — x} (2-1) 

where ^(7r) is the length of the sequence it, (j) is a universal partial recursive function 
which acts as a description method, i.e., when provided with inp ut (tt,?/) it gives 
a specification for x (for more on that see the nice paper by [23|). The chaoticity 
of a;(") is large if its complexity is close to its length n. The classical work of 
[H, 0, 13, [l^ relates chaoticity to stochasticity. In it is shown that chaoticity 

implies stochasticity. For a binary sequence s, let us denote by ||s|| the number of 
Is in s, then this can be seen from the following relationship (with p = 1/2): 



p(x("))|| _ 1 
;(i?(a;("))) 2 

where l{R{x^"'^)) is the length of the subsequence selected by R and c > is some 
absolute constant. From this it is apparent that as the chaoticity of a;'") grows the 
stochasticity of the selected subsequence i?(x^")) grows (the bias from 1/2 decreases). 
Also, the information content of the selection rule, namely K{R\n), has a direct 
effect on this relationship: the lower K{R\n) the stronger the stability (smaller 



K{x(-^)\n) + K{R\n) + 2 log K{R\n) 
/(i?(x("))) 



(2.2) 
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deviation of the frequency of Is from 1/2). In [H| the other direction which shows 
that stochasticity imphes chaoticity is proved. 

So referring back to the initial guesses we made in section [T] concerning our ex- 
pectation about the randomness of the error sequence , we now have a better 
clue and expect that the more algorithmically complex a learner's prediction rule is 
the higher the possibility that it distorts (introduces bias into) the randomness of 
the test sequence a;^"\ As will be shown, rather than resorting to algorithmic ran- 
domness theory (which requires dealing with the non-practical and hard to analyze 
notion of Kolmogorov complexity) a direct combinatorial approach will do. 

3. Problem definition 

Let us denote by {0, 1}* the space of all finite binary sequences. The learning 
problem consists of predicting the next bit value in a sequence X '"^ = Xi , X2 , . . . X„ 
of binary random variables drawn randomly according to a probability distribution 
P which is defined based on a Markov chain with a finite number of states s. For 
convenience, we let the state space be the set of natural numbers between and 
2*^ — 1 and represent each state s £ E>k = {0, 1, . . . , 2*^ — 1} by its unique binary 
vector b — [6(1), 5(2), . . . , 6(fc)] e {0, 1}*^. We alternatively refer to states either by 
their decimal number s or their binary vector b. 

Associated with these states is the transition matrix T where the row rep- 
resents the conditional probability distribution given state i. Consider drawing a 
random sequence AT^") using the chain by repeatedly making a transition from the 
current state St at time t to the next state St+i as dictated by T^. Suppose that 
St = i and St+i = j then the teacher emits for Xt+i the bit value that is appended 
to bt in order to obtain bt+i, i.e., the value Xt+i satisfies 

bj = M2),h{3),...,hik),Xt+i] 

where bj and bi are the binary vectors corresponding to the states j and i, re- 
spectively. Clearly, the structure of the Markov model allows only two outgoing 
transitions from any given state since Xt+i is binary; we call them a type-1 and 
type-0 transitions. Let us denote by A^fc a Markov model (chain) based on transi- 
tion matrix T^. We use k* to denote the order of the teacher's Markov chain (on 
which P is based). For any binary sequence x'-'^"''"-' of length at least A: > if we let 
bk = [6fc(l), . . . , bk{k)] = [xi, . . . , Xk] and define recursively the value of 

bt = [bt-i{2),...,bt-i{k),xt] (3.1) 

for all times t > k, where Xt G {0, 1} then the probability of a;('^+") with respect to 
the teacher's model is defined by 



P(5i 



Ji+k 



S, 



n+l 



On+k 



Sn 



Si 



bri- 



1+k 



(3.2) 



Henceforth, all random binary sequences are assumed to be drawn according to 
this probability distribution P which is based on model A4fe. . Neither the value k* 
nor the parameters of Mk* are known to the learner. From basic theory on finite 
Markov chains, since the matrix Tk* is stochastic (i.e., the sum of the elements in any 
row equals 1) then Mk* has a stationary probability distribution (not necessarily 
unique) , which we denote by P* . Let us denote by 

/3 = P* {Xt = 1) (3.3) 



at any time t. 
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We henceforth assume that the teacher reached stationarity and is producing 
sequences (both training and testing) with respect to P* (for clearer notation we 
will drop the star and simply write P). 

As a learner, we consider an algorithm that assumes a Markov model Mk of 
dimension A; > 1. The learner estimates the probability parameters 

Pij =]P{St+i = j\St = i), i,jeS.k 

by 

^ _ #{5't+i =j,St = i} 
P''- #{S,=z} 

where # {St = i} denotes the number of times state i appeared in the training 
sequence a:;('=+™) which drawn randomly by the teacher according to P and so ptj 
are the frequency of transitions. The first k bits of a::('=+'") indicate the initial state 
of the learner's model as it reads the training sequence, m is the number of state 
transitions taken by the teacher's model to generate the sequence. Note that pij 
are unknowns since they represent the probability of transition from state i to state 
j in the learner's model ^Ak given a random sequence generated according to the 
teacher probability distribution P (which is based on the unknown model Mk*)- 

After training, the learner is tested on a random test sequence obtained 
from the teacher based on ^Ak^ ■ The learner is repeatedly asked to predict the next 
bit for each of the last n random bits Xk+i, ■ ■ ■ , Xk+n, where as above, the first k 
bits of indicate the starting state of the learner's model as it reads the test 

sequence. The learner computes the posterior probability P{Xt+i = l\St), based 
on the learnt model A^fe, which is the probability that the next bit Xt^i = 1 given 
that the current state is St (at any given time t the current state consists of the 
last k bits seen in the test sequence up to t). The learner's decision (prediction) is 
based on the maximum a posteriori probability which is defined as follows: suppose 
that the current state is i then the decision is 



r 1 ifp(iK)>i_-p(i|z) 

1^ otherwise, 

i G Sfe, where p(l|z) is defined as pij for the state j whose bj — [6i(2), . . . , bi{k), 1] 
(a type-1 transition from state i) and the corresponding true probability (measured 
according to P) is denoted by p{l\i) ~ Pij- Note that (|3.4p may be expressed 
alternatively as 

^^^^ I ' otherwise." (^-5) 

Denote by to^ > the number of times state i was entered as the teacher scans the 
training sequence x^'^^™' from i = fc + l up to t = fc + m (as mentioned above, 
the initial state at time t = fc + 1 is the state whose b = [xi,X2, ■ ■ ■ ,Xk]). We 
will sometimes refer to rrn as the i*'' subsample size. Note that nii, i G Sfe, are 
dependent random variables since the Markov chain may visit each state a random 
number of times and they all must satisfy J2i=o^ = We claim that the 
p(l|z), i G S>k, are independent random variables when conditioned on the vector 
[mo, . . . , TO2fe_i] (which we henceforth denote by to). To see this, consider a training 
sequence a;*^'"^'^'' generated by the teacher according to (|3.2p . Let us denote the 
corresponding sequence of states by cr'™^ = (cti, . . . , CTm) with ai € S^. To show the 
dependence of 2:^*^+™^ on a^™^ we will sometimes write 2;^™+*^^ — x (cr'^™^). Then 
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by i|3.2p we have 

= ¥{Si = ai)V[S2 = a2 



Si = 0-1 



S3 = 0-3 



5*2 = (72 



Sni — CT^j 



Due to the structure of the Markov chain not every sequence a^"^^ e S™ is possible. 
Denote by F C S™ the set of valid state sequences cr^™) (possible under P) . We 
now show that given that the state sequence cr*^™-' corresponding to the data 2:^'"+'^) 
is in V and conditioned on m then any other state sequence that visits the same 
states the same number of times as cr^™) (perhaps in a different order) must have 
the same probability according to P. For any i e denote by Na{l\i) the number 
of type-1 transitions from state i in the sequence a. Without loss of generality, let 
us assume that always initially the state is i = so that we can write for the first 

I . Since all state transitions are either type-0 or type-1 then 



factor V[Si=ai 
we have 



m, af") e v\ = l[ {pim f^^'^'^ (1 -p(lN)) 



(3.6) 



where p{l\i) was defined above. Let a be a non-negative integer constant and define 
the vector function N{i) = [N{l\i),a — N{l\i)]. Associate a conditional probability 
function for the random variable N{i) as 



N{i) = [£,a~ £] 



Then l|3.6p may be written as 



a]=ipil\^)Yil-pil\^)r 



,(m+fe) 



e^j = llF(Nii) = [N„il\i),m,-N^iim. (3.7) 



ieSk 



For a fixed value of rrii the event "7V(i) = [N^ (1|«) , rrii — (l|i)]" equivalent to 
the event "p(l|z) = -^^7^^". Hence alternatively l|3.7p can be expressed as 



,(m+fc) 



e^U Y[f(p{i\i) 



(3., 



The right side of (|3.8p is a product of probability functions of the random vari- 
ables p{l\i)- So conditioned on m and on the event that a:'™''''^^ corresponds to a 



valid state sequence cr^™^ then the event that 



(m+k) 



is generated by the teacher 



is equivalent to the event that cr''"^ has transition frequencies p{l\i) that indepen- 
dently take the particular values ^<T(i|«)/mi according to x^™"'"'^^. It also follows 
that p{l\i) is the average of independent Bernoulli trials (success taken as a type-1 
transition from state i) hence is distributed according to the Binomial distribution 
with parameters rrii and p{l\i)- 

Let us state the main result of this paper. 

Theorem 1. Let < S < 1 and k,£,m,n be positive integers. Let P be the 
stationary probability distribution based on a finite, ergodic and reversible Markov 
chain with probability-transition matrix T that has a second largest eigenvalue X. 
Denote by j = {1 — max {0, A})/(1 -I- max {0, A}). Suppose after reaching station- 
arity the teacher generates a binary sequence X^") = Xi,X2. ■ ■ ■ ,Xn by repeatedly 
drawing Xt according to P and denote by (3 ^ ¥ (Xt = 1). Let 2;*^''+™) be a given 
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randomly drawn training sequence according to P. Suppose that a learner uses 
^(k+m) estimate the values of the parameters of a Markov model A^fe with states 
J G Sfe = {O, . . . , 2'"' — 1} . Denote by p{l\i) the probability of making a transi- 
tion from state i to a state whose binary vector is [bi{2), . . . , bi{k), 1] (according to 
T). Let Pi denote the probability that a Binomial random variable with parameters 
rui and p{l\i) is larger than '"i/2 and denote by p'™-' the average ^ X^ies Pi ^^'^ 
m = X^ieSfc'^i- Suppose that the learner is tested incrementally on a randomly 
drawn sequence x''^"'""-' generated according to P. The learner predicts an output 
bit yt for every input bit Xt in x'-'^"''"-' using Aik and the maximum a posteriori 
decision. Denote by the sequence of mistakes where it ^ ^ if Vt ^ Xt, and 

^4=0 otherwise, k+l<t<k + n. Suppose that the subsequence ^'^''^ of mistakes 
corresponding to 0-valued predictions is of length v > I. Denote by e{£, k, p^™\ S) 
the following expression 

Then for any < S < 1, with confidence at least 1 — S the deviation between (3 and 
the frequency of Is 0/^*^"-' is bounded as 




1 " 



j 



<e{£,k,p("^\S) 



where it is assumed that (1 + e{i, k, p*-™', (5))p*^'"' < 1/2. 

Let us interpret this theorem in the context of the discussion in sections [T] and [31 
We see that the quantity p(™)2'^ in the bound plays a role of effective complexity of 
the learner's decision rule. Let us assume that it is not too low since otherwise the 
learner's decision rule may be atypical (the number of Is in the vector d will deviate 
largely from the mean 2'^p(™^) which is an event of exponentially-small probability 
(with respect to the learner's model order). Now, under this assumption it is the 
first expression in the max{} which is important. We see that the larger p(™)2'^ 
the more that the mistake sequence ^^''^ can deviate in randomness. The theorem 
therefore implies that a Markov learner who trains on a random binary sequence 
(generated by a fixed Markov source) will with probability 1 — S learn a decision 
rule that makes a sequence of mistakes i'^'^\ v > that may deviate from being a 
purely random sequence (according to P) by an amount no larger than e{i, k, p, 5). 
In other words, its frequency of Is may deviate from the true probability of a 1 



by an amount that increases like O y\J ^-j^ J . Another implication of theorem is 

that the only explicit dependence on the source comes through 7 and p*^™^ via the 
distribution P. The value of 7 is bounded by and 1 where it is 1 if the Markov 
chain consists of independent states and gets closer to zero as the states become 
more mutually dependent. Thus from the bound we see that the more dependent 
the sequence generated by the source the larger the possible mistake-sequence's 
deviation in ranomness. There is no explicit dependence on k* however, implicitly, 
the bound does depend on k* since when the learner's model-order k is much 
smaller than the source's order k* then the 'memory' of the source is larger than 
the learner's window size (recall that k represents the number of bits per state). 
As the learner scans the training sequence using a small window (compared to 
the source's memory) it estimates the state-transition probabilities pij and obtains 
p(l|z) which on average are close to 1/2 (and so will the pi). Therefore p*^™^ will be 
relatively large (recall that we assume that its maximum value is at most (1-1- 6)1/2) 
. However \{ k > k* then the learner's window is close to the source's memory 
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length so in general the pij may deviate considerably from 1/2. In this case the 
Pi can be close to zero and hence make p'™-' much smaller than 1/2- Hence when 
the learner's model order (compared to the source's order) is small there may be a 
large deviation of randomness for the mistake sequence. Once the learner's order 
matches that of the teacher's then the effective complexity may be significantly 
lower hence making the mistake-sequence's deviation in randomness smaller. 

4. Proof of Theorem [T] 

We assume that the length m of the training sequence a;('^+"'' (which is drawn 
randomly according to P) is fixed. Referring to l|3.5p . the decision rule of the learner 
is denoted by a binary vector 

d=[d(0),...,rf(2'=-l)]. (4.1) 

Note that d fully describes the learner's prediction rule associated with model Mk 
(which is learnt based on x^^~^"^^). It describes every possible prediction that can 
be made for all possible situations (present states). From the previous section, the 
elements d{i) depend on the random variables p(l|j) which when conditioned on m 
are independent and Binomially distributed with parameters rrii andp(l|i), i G §/c, 
respectively. We have from p.Sp that 



\d{i) = l) = P p(l|i)> 



1 



(4.2) 



which equals the probability that a Binomial random variable (with parameters 
rrii, p(l\i)) is larger than ™-i/2. From (|4.2p it follows that the elements of d are 
non-identically distributed Bernoulli random variables with parameters pi. 

For a binary vector d denote by ||d|| the Zi-norm (or Hamming weight) of d. By 



the statement of the theorem, letting m — ^ 



we have 



(™) ^ 



P 



then the expected number of Is in d is 




(4.3) 



E[||d||]=E 



Y^d{^ 



i=0 



Let us define the following set, 

A^l = \d^{Q^f : l-e< 



IMII 



< l + e 



(4.4) 



2fcp(m) 

which depends on the subsample size vector m — [mo, . . . ,m2)c_i] through p'™^. 
Under P the probability of the event of not falling in Al^m is the same as the prob- 
ability of this event conditioned on the state sequence (corresponding to a;('^+™') 
being valid hence we have 



(m) 



G V 



irn) 



e V 



m. We now 



where the sum ranges over all non-negative m that satisfy ^ 
bound the first factor inside the sum by a quantity which only depends on m (not 
on the specific vector to). We have. 
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< 



{rf : > (1 + e)2V^"^} U {d : |M|| < (1 - 6)2^'"^} m, f^^™^ G V 

P > (1 + e)2'=p("') m, £ + P < (1 - e)2V^"'^ ct'™) £ F ' 

(4.6) 

As stated above, conditioned on m the d{i) are independent thus \\d\\ is a sum of 
independent non-identically distributed BernouUi random variables (also known as 
Poisson trials). We will use the following lemma. 

Lemma 2. Let Xi, . . . , Xn be independent Bernoulli random variables P{Xi = 
1) = Pi where < < 1 and denote by p = - Then for any < e < 1 we 

have 



and 



-npe /4 



The slight asymmetry in the bounds can be seen from the proof of the lemma which 
is based on applying Chernoff bound on the tail probability of the sum of Poisson 
trials (see Theorem 4.1 and 4.3, in 

By the above lemma and from l|4.4p . (|4.5p . I|4.6p it follows that the probability 



(4.7) 



that a random d does not fall in vlefm is bounded as 
Next, we estimate the cardinality of the set A^'^m- From l|4.4p we have, 

r(l+e)2'»p('")l 



i=L(l-e)2'=p("'J 

Denote by B{k,n) — (^), then it is easy to verify (see for instance @) that the 
ratio 

, . _ B{k,n) _n-k + l 
^^"^ ~ B{k-l,n) ~ k 

decreases monotonically as k increases from 1 to n. For k > n/2 this ratio is smaller 
than 1 hence it follows that for 

Kk+vf 



(j){k + l)(t){k + 2) ■ ■ ■ (j){k + v) < 



kV 



It follows that for any c > n/2 if we denote by ac+i 
following upper bound holds. 



fc + 1 



{c+1) = ^ then the 



c-\-v 



(1 



< 



c J 1 — a. 



c+l 
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Similarly, the inverse 4> ^{k) increases monotonically as k increases and for k < n/2 
is smaller than 1 hence it follows that for any c < n/2 we have 



k—c~v 



Therefore, as a bound on the cardinality of A^'^m we have 



< 



where 



and 



V[(l-e)2'=p(™)j7 



(ra 



if e < 1 
if e< 



2*^ - [(1 - e)2'=p('")J 
[(1 - e)2V"^J + 1 



(4., 



_i _ [(1 + e)2V(")] 
'^^ ~ 2*^ - [(1 + e)2'=p("')] + 1' 

By the assumption of the theorem, pi (defined in l|4.2p ) have an average value (|4.3p 
that satisfies (1 + e)p^'"' < | so the bottom bound in (|4.8p applies. Hence the 
bound simplifies to 

2*= \ 1 , , 

(4.9) 



< 



(l + e)2'=p('^ 



where we henceforth drop the [•] from the lower entry and leave it implicit. 

We now continue the analysis in order to obtain a bound on the possible deviation 
in randomness of the learner's mistake sequence. Let us denote by Rd ■ {0, 1}* 
{0, 1} the learner's decision rule which is defined based on the model A4k learnt by 
the learner where d is defined in l|4.ip . When given a finite random binary sequence 
Rd produces a binary prediction at time t + 1, referred to as an output bit, 
which equals 



Yt 



+1 



Rd{X^'^) 
d{St) 



where St is the state of the learner at time t and d{St) is as defined in l|3.4p . Let us 
denote by f the sequence of errors where the error bit at time t equals 1 or 
according to whether the event of an error in prediction occurs or not, respectively, 
that is, for a given input sequence a;^"^ and a prediction sequence y^"-* we define 



6 



if 

otherwise. 



yt ^ xt 



Consider the subsequence ^f"-' of corresponding to the errors associated with 
the prediction of a 0, i.e., S,'^^^ consists of the bits of at times t when the 
prediction bit yt = 0. Clearly, ^'"^ is a subsequence of the input x'"^ since when the 
prediction is the error bit equals the input bit. The length v of this subsequence 
is a random variable since it depends on the learnt model Mk- Since ^^"^^ is a 
subsequence of the error resulting from prediction by Rd and is also a subsequence 
of the input x*- we associate a selection rule Td (see section [2]) with the decision 
rule Rd and say that selects ^^'^^ from x^^^ . 

Let e'^J'I denote the event that based on a given fixed rule Td the selected sub- 
sequence ^^'^'^ from a random input sequence a;'^"-' is of length at least £ and its 
frequency of Is deviates from the expected value by at least e, formally. 
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x("):e('') = r, (x(")),.> 



> e 



where denotes the number of Is in the binary sequence ^'■"^ of length 

We use the following lemma which states a rate on the strong law of large num- 
bers for a Markov Chain. 

Lemma 3. Let Xi,. . . X„ be a finite ergodic and reversible Markov chain in sta- 
tionary state with a second largest eigenvalue A and f is a function taking values 
in [0,1] such that Mf (Xi) = /i . Denote by Xq — max{0,A} and the stationary 
probability distribution P*. Then for all e > such that /i + e < 1 and all time n 
the following bound holds: 



exp{-nt + e)}Eexp { t^f{X,) \ < e 



, 1 + ^0 



where the expectation is taken with respect to P* . 

The lemma follows from the proof of Theorem 1 of • 

We now apply this lemma. Consider the state subsequence S^"^ corresponding to 
the subsequence ^^"^ and for any state s £ S/t let the function /(s) in the lemma take 
as value the least significant bit of the binary-representation of s so that for the state 
at time i we have f{Si) — ^^'^K Thus in our case fj, in the lemma equals P(^{'^-' = 
1) = /3 since the stationary distribution is invariant for all times and in particular 
for those times where the prediction is 0. Also, A is the second largest eigenvalue of 
the transition matrix T. By the assumption of Theorem [T] the Markov chain of the 
teacher satisfies the conditions of the lemma. Now, the random subsequence S^'^^ 
may consist of a mix of random variables Si that are mutually independent (if their 
indices are not consecutive) and dependent (otherwise). Suppose that we have a 
sequence = {Si}^^-^ with q independent states and denote the corresponding 

mistake sequence by C^'''*). Denote by S'(«) and the parts of S^"'''^ that 

are independent and dependent, respectively. That is, S*'^^ ~ {Si^, Si^, . . . , Si^} 
are states visited by the subsequence S^'^^ where for any pair S^^ , S,, e 5(9) the 
indices are not consecutive \ij — ik\ > 1 hence S^"^^ is a sequence of independent 
random states. The other sequence S^'^~'^^ = {5'(''^\ S'^'"^), . . . , S'^''")} consists only 
of consecutive subsequences ( referred to as parts) S'^''^^ = {Si- ,Si^+i, . . . , Si-+rj } , 
j = 1, . . . , N , J2f=i ^ ^ 1 where each part consists of consecutive states hence 
pairwise-dependent. Denote by Xq,T^-q,Trj C Sk the sets of time-indices (with 
respect to the sequence S'^")) of states in S'^'^-', S'(''~'?) ^jj^j 3^"^^^ respectively. We 
now apply Markov's inequality (see for instance \X^) to the subsequence ^("'9) and 
obtain 

-^>'] =]P(^E/(^') >^ + ^) <exp{-J^i(/3 + e)}Eexp 1^^/(5,) 

(4.10) 

Due to indepdence of some of the parts of S^'^''^'^ we may split the expectation as 
follows: 
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i=l 



ieii,- 



N 



Eexp <^ i ^ m Eexp <^ t ^ ^ .(4.11) 



ieXq 



An independent sequence of random variables can be represented as a Markov chain 
with a row-stochastic transition matrix T all of whose rows are identical. By the 
Perron- Frobenius theorem such a matrix has a maximum eigenvalue of 1 and from a 
standard bound on the absolute value of any other eigenvalue (based on Gerschgorin 
disks flij) it follows that the second largest eigenvalue equals 0. Thus from Lemma 
[3] we obtain 

exp {-qt{p + e)} E cxp i t ^ /(5,) I < exp { -2qe'' } . 
Also from the lemma, we get for the inner part of the product in (|4.1ip 
exp{-i(i.-g)(/3 + e)}[]EexpJi ^ f{S,) 



i=i 



ieir 



< 



= cxp <^ -2(z/ - q)e'^ 



1 + Aoyj 

where again Aq — max {0, A}. Let 7 = (1 — Ao)/(l + Aq) then continuing to bound 
the right hand side of (|4.10p we get the following large deviation bound 



>ej < 2e^p{~2qe^}exp{-2{iy-q)je^} 

= 2exp{-2(7(l-7)e2}exp{-2i/7e2} 
It follows that for any fixed c?s{0,l}^ we have 



(4' 



f3 



V 



< 2^exp{-2j/7e2} 



> e 



iy,q)¥{q\i^)P{i^) 



> e 



^exp{-2(?(l-7)e2}P(g|z.) 

.q=0 



P(z/). 



The inner sum is bounded from above by 1 hence we obtain the simple bound 

P [e^H) < 2 exp {-2£-fe'^} . (4.12) 

Denote by d the binary vector associated with the learnt model Mk (which is based 
on a random training sequence a;(*^+'"^). We are interested in the probability of the 
event E^^ that after learning, given a random test sequence x'") for prediction, the 
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learner based on the selection rule selects a subsequence from a;^"^ of length 
at least t which is biased away from [3 by an amount greater than e. 
Denoting by Ae,m the complement of the set Al^m then we have 



(4:;) - 



d.e 



< 



< 2 



U 



it) 



exp {-2fe^7} + 2exp 



_2fcp(m)g2 



(4.13) 



(4.14) 



where the last inequality follows from (|4.7p and l|4.12p . To have the right side be 
no larger than S > it suffices to ensure that e satisfies 



e < 



and 



.,1 



Now, from 

Hence we may bound 
In 



< 



4(fe) 



< In 



(l + e)2V(™V l-a:^ 



In- 



(l + e)2V("V l-al 
Now, the following bound on the combination number is easy to verify. 



n\ n 
k) -~k\ 



from which we obtain 



In 



< fc In n — In j 



k\nn ~ In j 

3 = 1 



< A:(ln(-)+l 



(4.15) 



(4.16) 



where we used I'^J — /i ^i^xdx. Hence the first term of l|4.16p is bounded as 

2k 



In 



(i + .)2V"">) - + ('°((TT7v«) + 

< 3.«2'-'(i„(-l,) + 
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Now, 



2fe - [(1 + e)2V"^l + 1 ~ 2'= - 2*^-1 + 1 2^-1 
hence 

< 2'="^ + 1 < 2'=. 



It follows that the second term of l|4.16p is bounded from above by k. Therefore 
(|4.16p is bounded from above by 



Hence, for any < S < 1, with confidence at least 1—S the deviation ^ J2'j=i ^j"'' ~ P 

between the frequency of Is and /3 of the subsequence S,'^"^ selected by the rule 
based on the learnt model Mk is bounded by 

This concludes the proof of Theorem [TJ 
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